Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop

نویسندگان

  • Piotr Jan Dendek
  • Artur Czeczko
  • Mateusz Fedoryszak
  • Adam Kawa
  • Piotr Wendykier
  • Lukasz Bolikowski
چکیده

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved on Hadoop clusters.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Feedback - Study and Improvement of the Random Forest of the Mahout library in the context of marketing data of Orange

In the realm of Big Data systems, Hadoop has emerged as one of the most popular systems and a very diverse ecosystem has grown around it, meeting all kinds of functional and technical needs. One niche that should have been a place of choice in this ecosystem is data analytics: first because getting value out of large datasets requires efficient Machine Learning (ML) algorithms, second because l...

متن کامل

Object-Tagged RBAC Model for the Hadoop Ecosystem

Hadoop ecosystem provides a highly scalable, fault-tolerant and cost-effective platform for storing and analyzing variety of data formats. Apache Ranger and Apache Sentry are two predominant frameworks used to provide authorization capabilities in Hadoop ecosystem. In this paper we present a formal multi-layer access control model (called HeAC) for Hadoop ecosystem, as an academic-style abstrac...

متن کامل

A BigBench Implementation in the Hadoop Ecosystem

BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test oth...

متن کامل

Hadoop Block Placement Policy for Different File Formats

Now a day’s Peta-Bytes of data becomes the norm in industries. Handling, analyzing such big data is challenging task. Even frameworks like Hadoop (Open Source Implementation of MapReduce Paradigm) and NoSQL databases like Cassandra, HBase can be used to analyze and store such large data; heterogeneity of data is still an issue. Data centers usually have clusters formed using heterogeneous nodes...

متن کامل

Apache Pig's Optimizer

Apache Pig allows users to describe dataflows to be executed in Apache Hadoop. The distributed nature of Hadoop, as well as its execution paradigms, provide many execution opportunities as well as impose constraints on the system. Given these opportunities and constraints Pig must make decisions about how to optimize the execution of user scripts. This paper covers some of those optimization ch...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1303.5367  شماره 

صفحات  -

تاریخ انتشار 2013